Anderson–Darling Test (Goodness-of-Fit / Normality)#
The Anderson–Darling (AD) test is a goodness-of-fit test: it asks whether a sample plausibly comes from a target distribution. It is similar in spirit to Kolmogorov–Smirnov, but it puts much more weight on the tails, which is often exactly what you care about in practice.
Learning goals#
By the end you should be able to:
state the null/alternative hypotheses for the one-sample AD test
explain why AD is tail-sensitive
compute the AD statistic from first principles (NumPy-only)
interpret the result via p-values / critical values
diagnose where the mismatch happens using CDF + contribution plots
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) What the AD test is used for#
The question it answers#
“If the true data-generating distribution were
F, how surprising is it to see a sample whose empirical CDF looks like this?”
AD is most commonly used as a normality test (where F is a normal CDF with mean/variance estimated from the data),
but it is more general: you can test against any continuous CDF you can evaluate.
Typical applications#
checking normality of residuals (regression, control charts, signal processing)
validating a fitted distribution before using it for probabilities/quantiles
comparing tail behavior (outliers, risk, reliability)
When you should be careful#
AD is derived for continuous distributions; ties/discrete data need extra care.
Observations should be roughly i.i.d. (dependence can inflate false positives).
The null distribution of the statistic depends on whether distribution parameters are known or estimated.
2) Hypotheses and what “reject” means#
For the one-sample AD test:
H0: observations are i.i.d. from the target distribution with CDF
F(x)H1: the data are not from
F(x)
AD produces a statistic A² ≥ 0.
larger
A²⇒ larger deviation fromF(especially in the tails)you reject
H0whenA²is “too large” (via a p-value or critical value)
Important: failing to reject does not prove the distribution is correct; it just means you didn’t find strong evidence against it.
3) Intuition: why AD is tail-sensitive#
Let:
F(x)be the hypothesized CDFF_n(x)be the empirical CDF from your sample
One way to write the AD statistic is:
The key is the weight term:
When F is near 0 or near 1 (the tails), w(F) becomes very large, so tail mismatches are amplified.
F_grid = np.linspace(1e-3, 1 - 1e-3, 2000)
w = 1.0 / (F_grid * (1.0 - F_grid))
fig = go.Figure()
fig.add_trace(go.Scatter(x=F_grid, y=w, mode="lines", name="w(F)=1/(F(1-F))"))
fig.update_layout(
title="AD tail weighting: w(F) blows up near 0 and 1",
xaxis_title="F",
yaxis_title="weight w(F)",
yaxis_type="log",
)
fig.show()
4) The discrete formula (how you actually compute A²)#
Sort the data:
Compute CDF values under the null:
Then the one-sample AD statistic is:
Notes:
You must avoid
log(0)by clippingF_iaway from 0 and 1.For the common normality version (unknown mean/variance), people often use a small-sample correction:
Below we implement everything with NumPy only, including a normal CDF.
def _as_1d_finite(x):
x = np.asarray(x, dtype=float).ravel()
x = x[np.isfinite(x)]
return x
def erf_approx(x):
"""Vectorized erf approximation (Abramowitz & Stegun 7.1.26)."""
x = np.asarray(x, dtype=float)
sign = np.sign(x)
ax = np.abs(x)
p = 0.3275911
a1 = 0.254829592
a2 = -0.284496736
a3 = 1.421413741
a4 = -1.453152027
a5 = 1.061405429
t = 1.0 / (1.0 + p * ax)
poly = (((((a5 * t + a4) * t + a3) * t + a2) * t + a1) * t)
y = 1.0 - poly * np.exp(-(ax ** 2))
return sign * y
def norm_cdf(x, mu=0.0, sigma=1.0):
"""Normal CDF using erf approximation (NumPy-only)."""
x = np.asarray(x, dtype=float)
sigma = np.asarray(sigma, dtype=float)
if np.any(sigma <= 0):
raise ValueError("sigma must be positive.")
z = (x - mu) / (sigma * np.sqrt(2.0))
return 0.5 * (1.0 + erf_approx(z))
def ecdf(x):
"""Empirical CDF points (x_sorted, p)."""
x = _as_1d_finite(x)
xs = np.sort(x)
n = xs.size
p = np.arange(1, n + 1) / n
return xs, p
def anderson_darling_statistic(x, cdf, args=(), eps=1e-12):
"""One-sample Anderson–Darling statistic A² for a continuous CDF F."""
x = _as_1d_finite(x)
n = x.size
if n < 2:
raise ValueError("Need at least 2 finite observations.")
xs = np.sort(x)
i = np.arange(1, n + 1)
Fi = np.asarray(cdf(xs, *args), dtype=float)
Fi = np.clip(Fi, eps, 1.0 - eps)
s = np.sum((2 * i - 1) * (np.log(Fi) + np.log1p(-Fi[::-1])))
return -n - s / n
def anderson_darling_contributions(x, cdf, args=(), eps=1e-12):
"""Per-order-statistic contribution to A² (excluding the '-n' constant)."""
x = _as_1d_finite(x)
n = x.size
xs = np.sort(x)
i = np.arange(1, n + 1)
Fi = np.asarray(cdf(xs, *args), dtype=float)
Fi = np.clip(Fi, eps, 1.0 - eps)
term = (2 * i - 1) * (np.log(Fi) + np.log1p(-Fi[::-1]))
contrib = -term / n
return xs, contrib
def ad_normality(x):
"""
AD normality test pieces (NumPy-only).
Estimates mu,sigma from the sample and applies the common correction:
A²* = A² * (1 + 0.75/n + 2.25/n²)
"""
x = _as_1d_finite(x)
n = x.size
if n < 2:
raise ValueError("Need at least 2 finite observations.")
mu = x.mean()
sigma = x.std(ddof=0)
if sigma <= 0:
raise ValueError("sigma is 0; AD normality test is undefined.")
A2 = anderson_darling_statistic(x, norm_cdf, args=(mu, sigma))
corr = 1.0 + 0.75 / n + 2.25 / (n**2)
A2_star = A2 * corr
return {"n": n, "mu": mu, "sigma": sigma, "A2": A2, "A2_star": A2_star, "corr": corr}
def ad_pvalue_normal_approx(A2_star):
"""
Approximate p-value for the AD normality test (using A²*).
This is a widely used approximation (Stephens-style). For maximum accuracy,
especially in small samples, prefer a Monte Carlo p-value.
"""
A2_star = float(A2_star)
if A2_star < 0.2:
p = 1.0 - np.exp(-13.436 + 101.14 * A2_star - 223.73 * (A2_star**2))
elif A2_star < 0.34:
p = 1.0 - np.exp(-8.318 + 42.796 * A2_star - 59.938 * (A2_star**2))
elif A2_star < 0.6:
p = np.exp(0.9177 - 4.279 * A2_star - 1.38 * (A2_star**2))
else:
p = np.exp(1.2937 - 5.709 * A2_star + 0.0186 * (A2_star**2))
return float(np.clip(p, 0.0, 1.0))
def ad_null_stats_normality(n, n_sim=3000, seed=0):
"""Monte Carlo null distribution for A²* under normality (mu,sigma estimated)."""
rng = np.random.default_rng(seed)
stats = np.empty(n_sim, dtype=float)
for j in range(n_sim):
sim = rng.standard_normal(n)
stats[j] = ad_normality(sim)["A2_star"]
return stats
5) Worked example: data that are actually normal#
We generate a sample from a normal distribution. Under H0, we expect:
A²*to look typical for this sample sizea p-value that is not extremely small
n = 80
x_norm = rng.normal(loc=0.0, scale=1.0, size=n)
res_norm = ad_normality(x_norm)
p_approx_norm = ad_pvalue_normal_approx(res_norm["A2_star"])
null_stats = ad_null_stats_normality(n, n_sim=4000, seed=123)
p_mc_norm = float(np.mean(null_stats >= res_norm["A2_star"]))
alphas = np.array([0.15, 0.10, 0.05, 0.025, 0.01])
crit = np.quantile(null_stats, 1 - alphas)
print("Normal sample")
print({k: (float(v) if isinstance(v, (np.floating, float)) else v) for k, v in res_norm.items() if k != "corr"})
print(f"Approx p-value (normality): {p_approx_norm:.4f}")
print(f"MC p-value (normality): {p_mc_norm:.4f}")
print()
print("Monte Carlo critical values for A²* (reject if A²* > critical):")
for a, c in zip(alphas, crit):
print(f" alpha={a:>5.3f} critical={c:.4f}")
Normal sample
{'n': 80, 'mu': 0.025721174006128944, 'sigma': 0.7695278508490445, 'A2': 0.40370944560065425, 'A2_star': 0.40763615075512927}
Approx p-value (normality): 0.3479
MC p-value (normality): 0.3688
Monte Carlo critical values for A²* (reject if A²* > critical):
alpha=0.150 critical=0.5714
alpha=0.100 critical=0.6443
alpha=0.050 critical=0.7715
alpha=0.025 critical=0.8856
alpha=0.010 critical=1.0426
# P–P plot: theoretical CDF values vs empirical probabilities
xs = np.sort(x_norm)
i = np.arange(1, n + 1)
emp_p = (i - 0.5) / n
theory_p = norm_cdf(xs, res_norm["mu"], res_norm["sigma"])
fig = go.Figure()
fig.add_trace(go.Scatter(x=theory_p, y=emp_p, mode="markers", name="data"))
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode="lines", name="y = x"))
fig.update_layout(
title="P–P plot for normality (normal sample)",
xaxis_title="Theoretical CDF F(xᵢ)",
yaxis_title="Empirical probability (i−0.5)/n",
)
fig.show()
# Contribution view: where the statistic comes from
_, contrib = anderson_darling_contributions(x_norm, norm_cdf, args=(res_norm["mu"], res_norm["sigma"]))
contrib_star = contrib * res_norm["corr"]
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=np.arange(1, n + 1),
y=contrib_star,
mode="lines+markers",
name="contribution",
)
)
fig.update_layout(
title="Where A²* comes from (normal sample)",
xaxis_title="Order statistic index i (1=smallest, n=largest)",
yaxis_title="Contribution to A²*",
)
fig.show()
fig = px.histogram(
null_stats,
nbins=60,
title=f"Null distribution of A²* under normality (n={n}, MC={len(null_stats)})",
labels={"value": "A²*"},
)
fig.add_vline(x=res_norm["A2_star"], line_color="red", line_width=3)
fig.add_annotation(
x=res_norm["A2_star"],
y=0,
yanchor="bottom",
text=f"observed A²*={res_norm['A2_star']:.3f}\nMC p≈{p_mc_norm:.3f}",
showarrow=True,
arrowhead=2,
)
fig.update_layout(showlegend=False)
fig.show()
6) Worked example: heavy tails (t distribution)#
Now we generate data with heavier tails than a normal distribution.
Because AD heavily weights the tails, it often detects this kind of deviation strongly.
x_t = rng.standard_t(df=3, size=n)
res_t = ad_normality(x_t)
p_approx_t = ad_pvalue_normal_approx(res_t["A2_star"])
p_mc_t = float(np.mean(null_stats >= res_t["A2_star"]))
print("t(3) sample (heavy tails)")
print({k: (float(v) if isinstance(v, (np.floating, float)) else v) for k, v in res_t.items() if k != "corr"})
print(f"Approx p-value (normality): {p_approx_t:.4f}")
print(f"MC p-value (normality): {p_mc_t:.4f}")
t(3) sample (heavy tails)
{'n': 80, 'mu': 0.015710624639228243, 'sigma': 1.6263187423964756, 'A2': 2.4402203779633993, 'A2_star': 2.4639553339834337}
Approx p-value (normality): 0.0000
MC p-value (normality): 0.0000
# Empirical CDF vs fitted normal CDF (visual goodness-of-fit)
xs_emp, p_emp = ecdf(x_t)
p_fit = norm_cdf(xs_emp, res_t["mu"], res_t["sigma"])
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=xs_emp,
y=p_emp,
mode="lines",
line_shape="hv",
name="Empirical CDF",
)
)
fig.add_trace(go.Scatter(x=xs_emp, y=p_fit, mode="lines", name="Fitted normal CDF"))
fig.update_layout(
title="Empirical CDF vs fitted normal CDF (t(3) sample)",
xaxis_title="x",
yaxis_title="CDF",
)
fig.show()
# Compare contribution profiles: normal vs heavy-tailed
_, contrib_n = anderson_darling_contributions(x_norm, norm_cdf, args=(res_norm["mu"], res_norm["sigma"]))
_, contrib_t = anderson_darling_contributions(x_t, norm_cdf, args=(res_t["mu"], res_t["sigma"]))
contrib_n_star = contrib_n * res_norm["corr"]
contrib_t_star = contrib_t * res_t["corr"]
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=np.arange(1, n + 1),
y=contrib_n_star,
mode="lines+markers",
name="Normal sample",
)
)
fig.add_trace(
go.Scatter(
x=np.arange(1, n + 1),
y=contrib_t_star,
mode="lines+markers",
name="t(3) sample",
)
)
fig.update_layout(
title="AD is tail-sensitive: contributions often spike in the extremes",
xaxis_title="Order statistic index i (tails are near 1 and n)",
yaxis_title="Contribution to A²*",
)
fig.show()
7) How to interpret the result (what it means)#
The mechanics#
Pick a null distribution
F(and decide whether its parameters are fixed or estimated).Compute the AD statistic (for normality, we used
A²*).Convert
A²/A²*into a decision:critical value: reject if
A²* > c_αp-value: reject if
p < α
The meaning#
If you reject: the sample is unlikely to come from the target distribution (given the test’s assumptions). This does not tell you the exact alternative; use the plots to see whether the mismatch is in the center, skew, or tails.
If you fail to reject: you don’t have strong evidence against the target distribution. It is not a proof of normality.
A practical rule of thumb#
With large n, even tiny deviations can be statistically significant. Use AD as a diagnostic alongside effect-size thinking and domain context.
8) Common pitfalls and diagnostics#
Parameter estimation matters: the null distribution changes if you estimate parameters from the data.
Outliers: AD will often flag a single extreme point (sometimes that’s desired; sometimes it’s a data-quality problem).
Dependence (time series, spatial data): AD assumes i.i.d.; autocorrelation can create false rejections.
Discrete / rounded data: ties violate the continuous-data derivation; consider simulation/permutation approaches.
Multiple testing: if you test many groups/features, adjust your significance threshold.
Diagnostics that pair well with AD:
CDF / P–P plots (global mismatch)
contribution plots (where in the distribution the statistic is coming from)
domain-specific residual checks (e.g., heteroskedasticity, seasonality)
9) Exercises#
Implement a CDF for another distribution (e.g., exponential with rate λ) and run
anderson_darling_statisticagainst it.Use
ad_null_stats_normalityto compute your own critical values for different sample sizes (n=20,n=200) and compare.Build two alternatives (skewed vs heavy-tailed) and compare how the contribution plots differ.
References#
Anderson, T. W., & Darling, D. A. (1952). Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes.
Stephens, M. A. (1974/1976). Approximations and corrections for EDF statistics (commonly used for AD p-values).
SciPy:
scipy.stats.anderson(returns the statistic and critical values for several distributions).